Introduction¶
Recommender Systems¶
Recommender systems are algorithms that are designed to recommend relevant items to users.
They are useful in cases where there are a large number of potential items or choices that could be suggested.
The objective of a recommender system is to predict the most likely items that would be of interest to the user.
Due to the overwhelming number of items or choices that can be presented by online platforms,
having a way to filter, prioritise, and efficiently present items of interest can help alleviate overloading the user with choices.
Recommender systems have proven to be critical in some industries as they can boost sales while also improving customers’ experience.
Recommender systems have proven particularly useful for e-commerce, online advertising, and streaming services such as YouTube, Netflix, and Spotify.
Companies use state-of-the-art recommender systems to distinguish them from competitors and to improve customer retention.
There are many different methods and approaches to building a recommender system. However, the underlying concept of how they all work remains fairly constant. Recommender systems search through large volumes of dynamically generated data to provide users with personalised content and services [IFO15] The system filters the most important information for the user, including user’s item history and interests. It then computes the similarity between the user and all items and recommends the items with the highest similarity score. There are two major paradigms of recommender systems [Roc19] , collaborative and content based methods.
Collaborative Recommenders¶
Collaborative recommenders rely solely on data generated by users. This approach focuses on past interactions between users and items in order to make new recommendations. These user-item interactions are stored in a so-called “user-item interactions matrix”. The primary advantage of a collaborative approach is that is requires no contextual data about users or items.
Content Recommenders¶
Content recommenders rely on user-item iterations as well as user and/or item features. This type of system can incorporate user information such as age, sex, location, occupation or any other personal information or any item features. Content based methods can often explain the observed user-item iterations.
Objective¶
The objective of this project is to develop a recommender system for music artists. To do this we will use the Last.fm dataset, which can be accessed from here. The dataset contains 92,800 artist listening records from 1,892 users, as well as information on social networking and tagging, from the online music system. We want to be able to make relevant and useful recommendations based on users’ usage history.
1. Data Exploration & Pre-Processing¶
Before we begin building a recommender system we must first get a sense of the dataset. The objective of this chapter is to explore the dataset and to perform any necessary pre-processing of the data. We will use pandas to read in and explore the data. We will sanity check the data, handle inconsistencies and outliers, and make sure the data can be used to build an accurate and reliable recommender system.
1.1. Imports¶
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import preprocessing
1.2. Read in Data¶
artists = pd.read_csv('..\\data\\hetrec2011-lastfm-2k\\artists.dat', sep='\t', encoding='latin-1')
tags = pd.read_csv('..\\data\\hetrec2011-lastfm-2k\\tags.dat', sep='\t', encoding='latin-1')
user_artists = pd.read_csv('..\\data\\hetrec2011-lastfm-2k\\user_artists.dat', sep='\t', encoding='latin-1')
user_friends = pd.read_csv('..\\data\\hetrec2011-lastfm-2k\\user_friends.dat', sep='\t', encoding='latin-1')
user_tagged_artists = pd.read_csv('..\\data\\hetrec2011-lastfm-2k\\user_taggedartists.dat', sep='\t', encoding='latin-1')
1.3. Overview of the DataFrames¶
1.3.1. artists¶
This dataframe contains information about 17,632 music artists listened to and tagged by the users.
artists.head(3)
| id | name | url | pictureURL | |
|---|---|---|---|---|
| 0 | 1 | MALICE MIZER | http://www.last.fm/music/MALICE+MIZER | http://userserve-ak.last.fm/serve/252/10808.jpg |
| 1 | 2 | Diary of Dreams | http://www.last.fm/music/Diary+of+Dreams | http://userserve-ak.last.fm/serve/252/3052066.jpg |
| 2 | 3 | Carpathian Forest | http://www.last.fm/music/Carpathian+Forest | http://userserve-ak.last.fm/serve/252/40222717... |
print('number of rows: ' + str(artists.shape[0]))
pd.DataFrame([artists.nunique(),artists.isna().sum()], index=['unique_entries','null_values']).T
number of rows: 17632
| unique_entries | null_values | |
|---|---|---|
| id | 17632 | 0 |
| name | 17632 | 0 |
| url | 17632 | 0 |
| pictureURL | 17188 | 444 |
Delete pcitureURL column
The column pcitureURL contains links that cannot be accessed, and has no use to us, so we will delete it.
del artists['pictureURL']
Reset ids
Set id to start from 0 and to be consecutive.
artist_id_dict = pd.Series(artists.index.values,index=artists.id).to_dict()
artists.id = artists.id.map(artist_id_dict)
print('number of rows: ' + str(artists.shape[0]))
pd.DataFrame([artists.nunique(),artists.isna().sum()], index=['unique_entries','null_values']).T
number of rows: 17632
| unique_entries | null_values | |
|---|---|---|
| id | 17632 | 0 |
| name | 17632 | 0 |
| url | 17632 | 0 |
1.3.2. tags¶
This dataframe contains the set of all tags available in the dataset. Tags can be informative of the musical genre.
tags.head(3)
| tagID | tagValue | |
|---|---|---|
| 0 | 1 | metal |
| 1 | 2 | alternative metal |
| 2 | 3 | goth rock |
print('number of rows: ' + str(tags.shape[0]))
pd.DataFrame([tags.nunique(),tags.isna().sum()], index=['unique_entries','null_values']).T
number of rows: 11946
| unique_entries | null_values | |
|---|---|---|
| tagID | 11946 | 0 |
| tagValue | 11946 | 0 |
Reset ids
Similar to artist id, reset tagID to start from 0 and to be consecutive.
tags_dict = pd.Series(tags.index.values,index=tags.tagID).to_dict()
tags.tagID = tags.tagID.map(tags_dict)
1.3.3. user_artists¶
The dataframe contains the artists listened to by each user, as well as a listening count (weight) for each user, artist pair.
user_artists.head()
| userID | artistID | weight | |
|---|---|---|---|
| 0 | 2 | 51 | 13883 |
| 1 | 2 | 52 | 11690 |
| 2 | 2 | 53 | 11351 |
| 3 | 2 | 54 | 10300 |
| 4 | 2 | 55 | 8983 |
print('number of rows: ' + str(user_artists.shape[0]))
pd.DataFrame([user_artists.nunique(),user_artists.isna().sum()], index=['unique_entries','null_values']).T
number of rows: 92834
| unique_entries | null_values | |
|---|---|---|
| userID | 1892 | 0 |
| artistID | 17632 | 0 |
| weight | 5436 | 0 |
Reset userID and Map artistID
Reset userID to start from 0 and to be consecutive.
Replace artistIDs with the corresponding ids in artists.
user_id_dict = pd.Series(range(0,user_artists.userID.nunique()),index=user_artists.userID.unique()).to_dict()
user_artists.userID = user_artists.userID.map(user_id_dict)
user_artists.artistID = user_artists.artistID.map(artist_id_dict)
1.3.4. user_friends¶
This dataframe contains the friend relations between users in the dataset.
user_friends.head(3)
| userID | friendID | |
|---|---|---|
| 0 | 2 | 275 |
| 1 | 2 | 428 |
| 2 | 2 | 515 |
print('number of rows: ' + str(user_friends.shape[0]))
pd.DataFrame([user_friends.nunique(),user_friends.isna().sum()], index=['unique_entries','null_values']).T
number of rows: 25434
| unique_entries | null_values | |
|---|---|---|
| userID | 1892 | 0 |
| friendID | 1892 | 0 |
While the dataframe consists of 25,434 rows, it in fact only contains half this amount (12,717) of bi-directional relations, each relation is included twice, an example of which can be seen in the next cell’s output.
user_friends[((user_friends['userID'] == 2) & (user_friends['friendID'] == 275)) | ((user_friends['userID'] == 275) & (user_friends['friendID'] == 2))]
| userID | friendID | |
|---|---|---|
| 0 | 2 | 275 |
| 3837 | 275 | 2 |
Map userID and friendID
Replace both userID and friendID so they corresponding with the intended users.
user_friends.userID = user_friends.userID.map(user_id_dict)
user_friends.friendID = user_friends.friendID.map(user_id_dict)
1.3.5. user_tagged_artists¶
This dataframe contains the tag assignments of artists provided by each user and the accompanying date of when the tag assignments were done.
user_tagged_artists.head(3)
| userID | artistID | tagID | day | month | year | |
|---|---|---|---|---|---|---|
| 0 | 2 | 52 | 13 | 1 | 4 | 2009 |
| 1 | 2 | 52 | 15 | 1 | 4 | 2009 |
| 2 | 2 | 52 | 18 | 1 | 4 | 2009 |
print('number of rows: ' + str(user_tagged_artists.shape[0]))
pd.DataFrame([user_tagged_artists.nunique(),user_tagged_artists.isna().sum()], index=['unique_entries','null_values']).T
number of rows: 186479
| unique_entries | null_values | |
|---|---|---|
| userID | 1892 | 0 |
| artistID | 12523 | 0 |
| tagID | 9749 | 0 |
| day | 4 | 0 |
| month | 12 | 0 |
| year | 10 | 0 |
Interestingly each user has tagged at least one artist but not all artists have received a tag. The number of unique entries for the day and year seems odd and will be further investigated in the next two cells.
user_tagged_artists.day.value_counts()
1 182948
5 1505
6 1469
9 557
Name: day, dtype: int64
Considering the size of the dataframe it is inprobable that users only ever tagged on the same four days of every month, and is instead likely indicative of the days when the data was collected. Therefore, we will not place any significance on the day in further analysis.
user_tagged_artists.year.value_counts()
2010 54998
2009 43366
2008 40273
2007 20415
2011 15125
2006 9814
2005 2483
1956 3
1957 1
1979 1
Name: year, dtype: int64
Last.fm was founded in 2002, and the internet in 1983! Therefore, the 4 dates that correspond to the years 1956, 1957, or 1979 are obvious errors. The dataset was compiled in 2011 so it is conceivable that the data collected was created sometime between 2005 and 2011.
Replace years before 2005
Replace any year before 2005 with 2005
user_tagged_artists.loc[user_tagged_artists['year'] < 2005,'year'] = 2005
Map userID, artistID and tagID
user_tagged_artists.userID = user_tagged_artists.userID.map(user_id_dict)
user_tagged_artists.artistID = user_tagged_artists.artistID.map(artist_id_dict)
user_tagged_artists.tagID = user_tagged_artists.tagID.map(tags_dict)
1.4. Data Exploration & Visualisations¶
1.4.1. Number of artists listened to by each user¶
user_artists_count = pd.DataFrame([artists_df.shape[0] for user, artists_df in user_artists.groupby('userID')]).value_counts().rename_axis('number_of_artists').reset_index(name='counts')
plt.figure(figsize=(15,6))
ax = sns.histplot(data=user_artists_count,x='number_of_artists',weights='counts',bins=5)
ax.bar_label(ax.containers[0],c='r')
ax.set(title='Number of Artists Listened to by Users')
plt.show()
This plot is left skewed, with the vast majority of users having listened to 40-50 unique artists. Only 51 users, or 2.7% of users, have listened to fewer than 40 artists. This is reassuring as we are planning to use this dataset to build a recommender system. The more artists listened to by each user the better the recommender system should perform as the system will be better able to determine users’ musical interests and tastes.
1.4.2. Number of listeners per artist¶
listeners_count = user_artists.groupby('artistID').size().reset_index(name='number_of_listeners')
plt.figure(figsize=(15,6))
ax = sns.histplot(data=listeners_count,x='number_of_listeners',bins=10)
ax.bar_label(ax.containers[0],c='r')
ax.set(title='Number of Listeners per Artist')
plt.show()
listeners_count.number_of_listeners.describe()
count 17632.000000
mean 5.265086
std 20.620315
min 1.000000
25% 1.000000
50% 1.000000
75% 3.000000
max 611.000000
Name: number_of_listeners, dtype: float64
This right skewed plot is not ideal. It is common practice for recommender systems to disgard items (in this case artists) with few iterations. However, 50% of artists in this dataset have only been listened to by a single user. If we were to exclude artists with less than a certain number of interactions we lose the majority of the dataset so we will work with what we have.
1.4.3. Weight for each user/artist pair¶
user_artists.weight.describe()
count 92834.00000
mean 745.24393
std 3751.32208
min 1.00000
25% 107.00000
50% 260.00000
75% 614.00000
max 352698.00000
Name: weight, dtype: float64
sns.set_style('darkgrid')
sns.boxplot(x =user_artists.weight)
plt.show()
The distribution is extremely left skewed. 75% of weights are less than 614, yet the max weight is 352,698. There are extreme outliers. We believe the dataset contains data from 2005-2011. For a user to have listened to a certain artist 352,698 times over that peroiod they would have had to listen to that artist roughly 1,600 times every day. It is unlikey that these extreme outliers are extreme music lovers and is instead more conceivable that there are errors in the data.
1.4.3.1. Determine cut-off for allowable weight values¶
To determine the maximum weight allowed, we will group users by weight bin (0-500, 500-1000, 1000-1500 …). The logic we will apply will assume that if you have listened to one artist betweeen 500-1000 times then you’ve probably listened to other artists between 0-500 times. If you are a big music listener and have listened to your absolute favourite artist 2000+ times, then you have probably listened to another of your favourite artists between 1500-2000 times.
We will calculate the percentage of users in each bin which were also in the previous bin. E.g. what percentage of users who have at least one weight between 3000 and 3500, also have at least one weight between 2500 and 3000?
pc_users_in_both = []
incrs = list(range(50,len(user_artists.weight),500))
users_below = set()
for inc in incrs:
users_in_inc = set(user_artists[(user_artists.weight > inc) & (user_artists.weight < (inc + 500))].userID)
users_in_both = users_below.intersection(users_in_inc)
if len(users_below) > 0:
pc_users = len(users_in_both) / len(users_below) * 100
pc_users_in_both.append(pc_users)
else:
pc_users_in_both.append(0)
users_below = users_in_inc
sns.set_style('darkgrid')
sns.set(font_scale = 1)
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(12, 4))
sns.lineplot(x=incrs, y=pc_users_in_both, ax=ax[0]).set_title('% of users in bin also in lower bin')
ax[0].set(xlabel='start of bin', ylabel='percentage of users also in directly lower bin',xlim=[500,80000])
sns.lineplot(x=incrs, y=pc_users_in_both, ax=ax[1]).set_title('% of users in bin also in lower bin (zoomed in)')
ax[1].set(xlabel='start of bin', ylabel='percentage of users also in directly lower bin',xlim=[500,10000])
plt.show()
As expected the percentage of users in a bin and in the consecutive bin initially steadily decreases, as most users reach their max weight. However, the plot quick starts to fluctuate. To decide on the max allowable weight, we will focus on the right zoomed in graph. Up utill around 3,500 the graph is almost linear. Therefore, we will set the max allowable weight to the 3,500. It is possible that we will lose some super fans through this process but it is better than having errors in the data. For all weights above 3,500 we will set them to the user’s median weight if it is below 3,500, otherwise it will be set to the median weight of all users’ weights.
users_to_be_updated = user_artists.userID[user_artists['weight'] > 3500].unique()
users_artists_below_thres = user_artists[user_artists['weight'] < 3500]
user_new_weights_dict = round(users_artists_below_thres[users_artists_below_thres.userID.isin(list(users_to_be_updated))].groupby('userID').weight.median()).to_dict()
for index, row in user_artists.iterrows():
if row.weight > 3500:
try:
row.weight = user_new_weights_dict[row.userID]
except:
artist_median = user_artists[user_artists['artistID'] == row.artistID].weight.median()
if artist_median < 3500:
row.weight = user_artists[user_artists['artistID'] == row.artistID].weight.median()
else:
row.weight = user_artists.weight.median()
sns.boxplot(x=user_artists.weight).set_title('Boxplot of weights after setting cut-off')
plt.show()
1.4.3.2. Scale the weights between 1-5¶
min_weight = min(user_artists.weight)
max_weight = max(user_artists.weight)
for i in range(0,len(user_artists)):
user_artists.weight.iloc[i] = np.interp(user_artists.weight.iloc[i],[min_weight,max_weight],[1,5])
user_artists.describe()
| userID | artistID | weight | |
|---|---|---|---|
| count | 92834.000000 | 92834.000000 | 92834.000000 |
| mean | 944.222483 | 3235.736724 | 1.536476 |
| std | 546.751074 | 4197.216910 | 0.664117 |
| min | 0.000000 | 0.000000 | 1.000000 |
| 25% | 470.000000 | 430.000000 | 1.120034 |
| 50% | 944.000000 | 1237.000000 | 1.291512 |
| 75% | 1416.000000 | 4266.000000 | 1.667619 |
| max | 1891.000000 | 17631.000000 | 5.000000 |
1.4.4. Most popular artists¶
To find the most popular artists we will use two approaches. The first will sum all listens per artist and the second will perform a count of unique listeners per artist.
artist_pop = pd.concat([pd.DataFrame(user_artists.groupby('artistID').size()), pd.DataFrame(user_artists.groupby('artistID').weight.sum())],axis=1).set_axis(['listeners', 'listens'], axis=1, inplace=False).reset_index()
artist_pop = pd.merge(artist_pop, artists.iloc[:, 0:2], left_on='artistID', right_on='id')
bylistens = artist_pop.sort_values('listens', ascending=False).head(10)
bylisteners = artist_pop.sort_values('listeners', ascending=False).head(10)
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(12, 4))
fig.tight_layout()
sns.barplot(x=bylistens.listens, y=bylistens.name,palette="crest_r",ax=ax[0])
ax[0].set(xlabel='listens', ylabel='artist', title='most popular artists by number of LISTENS')
ax = sns.barplot(x=bylisteners.listeners, y=bylisteners.name,palette="rocket",ax=ax[1])
ax.set(xlabel='listeners', ylabel='artist', title='most popular artists by number of LISTENERS')
plt.subplots_adjust(wspace = 0.5)
plt.show()
Lady Gaga appears as the top artist in both, this is reflective of listening tastes for the time period that the data was collected. While there are some differences between the two charts, on the whole popular artists have the most listens and listeners.
1.4.5. Most popular tags¶
Remove special characters
Special characters are causing almost identical tags to be counted as two different tags. This is an issue for tags such as 80’s and 80s. To combat this we will remove all special characters.
tag_dict = pd.Series(tags.tagValue,index=tags.tagID).to_dict()
user_tagged_artists_tag = user_tagged_artists.copy()
user_tagged_artists_tag['tag'] = user_tagged_artists_tag.tagID.map(tag_dict)
user_tagged_artists_tag.tag = user_tagged_artists_tag.tag.apply(lambda x: ''.join(char for char in x if char.isalnum()))
Group on tag
tags_ranked = user_tagged_artists_tag.groupby('tag').size().reset_index(name='counts').sort_values('counts',ascending=False)
ax = sns.barplot(x=tags_ranked.head(10).counts, y=tags_ranked.head(10).tag,palette="Greens_r")
ax.set(xlabel='count of tag usage', ylabel='tag',title='most used tags')
plt.show()
No big surprises here.
1.4.6. Can tags be used as genres?¶
From looking at the top tags it appears that they could be used as item features representing artist genres. I want to determine how common the top 20 tags are. For each artist tagged, I will check if at least one of their tags is also in the top 20 overall tags.
top20tags = set(tags_ranked.head(20).tag.values)
# artist_tag = user_tagged_artists.copy()
# artist_tag['tagValue'] = artist_tag.tagID.map(tag_dict)
artist_tagSet = user_tagged_artists_tag.groupby('artistID').tag.agg(lambda x:set(x.unique())).reset_index(name='tagSet')
artists_with_top_tag = 0
for index,row in artist_tagSet.iterrows():
inters = top20tags.intersection(row.tagSet)
if len(inters) > 0:
artists_with_top_tag += 1
pc_artists = round(artists_with_top_tag/artist_tagSet.shape[0] * 100)
print('Percentage of tagged artists with a tag in top 20 tags: ' + str(pc_artists) + '%')
Percentage of tagged artists with a tag in top 20 tags: 63%
We can make an assumption about the primary genre of 63% of tagged artists using this approach. However, as not all artist are tagged, this would actually only cover approximately 46% of all artists. Nevertheless, there are too many unique tags to one-hot-encode all of the tags, so we will procede with the top 20 tags.
Approach:
compute set of 20 most used tags: top20tags (done)
For each artist:
Collect set of all tags: artist_tagSet
Find intersection of artist_tagSet and top20tags
If no mutual tags, leave artist tag as no_tag
Else if only one mutual tag, set artist tag as mutual tag
Else, from the set of mutual tags, find the most tag for that artist
top20tags
{'80s',
'90s',
'alternative',
'alternativerock',
'ambient',
'british',
'classicrock',
'dance',
'electronic',
'experimental',
'femalevocalists',
'hardrock',
'hiphop',
'indie',
'indierock',
'metal',
'newwave',
'pop',
'rock',
'singersongwriter'}
tag_list = pd.Series(['no_tag'] * artist_tagSet.shape[0])
for index,row in artist_tagSet.iterrows():
inters = top20tags.intersection(row.tagSet)
if len(inters) == 1:
tag_list[index] = str(''.join(inters))
elif len(inters) > 1:
tag_list[index] = user_tagged_artists_tag[user_tagged_artists_tag.artistID == row.artistID][user_tagged_artists_tag.tag.isin(inters)].groupby('tag').tag.count().reset_index(name='counts').sort_values('counts',ascending=False).tag[0]
artist_tagSet['tag'] = tag_list
C:\Users\User\AppData\Local\Temp/ipykernel_3016/3829306717.py:7: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
tag_list[index] = user_tagged_artists_tag[user_tagged_artists_tag.artistID == row.artistID][user_tagged_artists_tag.tag.isin(inters)].groupby('tag').tag.count().reset_index(name='counts').sort_values('counts',ascending=False).tag[0]
artist_tagSet.head(3)
| artistID | tagSet | tag | |
|---|---|---|---|
| 0 | 0.0 | {weeabo, jrock, japanese, gothic, betterthanla... | no_tag |
| 1 | 1.0 | {truegothemo, seenlive, german, gothicrock, vo... | ambient |
| 2 | 2.0 | {blackmetal, norwegianblackmetal, saxophones, ... | no_tag |
One-Hot Encoding
genres = pd.get_dummies(artist_tagSet['tag'])
artist_features = pd.concat([artists, genres], axis=1).fillna(0)
del artist_features['url']
artist_features.head(3)
| id | name | 80s | 90s | alternative | alternativerock | ambient | british | classicrock | dance | ... | hardrock | hiphop | indie | indierock | metal | newwave | no_tag | pop | rock | singersongwriter | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | MALICE MIZER | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 |
| 1 | 1 | Diary of Dreams | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 2 | Carpathian Forest | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 |
3 rows × 23 columns
1.4.7. How similar are friends’ music tastes?¶
This is to determine how useful the friends relationships are. We will calculate the percentage of artists that friends have in common.
Remove repeated relationships
user_friends = user_friends[pd.DataFrame(np.sort(user_friends.values), columns=user_friends.columns, index=user_friends.index).duplicated(keep='last')]
Percentage of artists in common between a pair of friends = \(\frac{Friend1 artists \bigcap Friend2 artists}{Friend1 artists \bigcup Friend2}\times 100\)
artists_incommon = []
for index, row in user_friends.iterrows():
friend1 = set(user_artists[user_artists.userID == row.userID].artistID)
friend2 = set(user_artists[user_artists.userID == row.friendID].artistID)
incommon = friend1.intersection(friend2)
total = friend1.union(friend2)
pc_incommon = len(incommon) / len(total) * 100
artists_incommon.append(pc_incommon)
print("Mean percentage of artists in common: %.2f%%" % np.mean(np.array(artists_incommon)))
Mean percentage of artists in common: 10.26%
1.5. Create DataFrames for Recommendation Building¶
1.5.1. Artist Information¶
Let’s create a datframe that contains information on each artist. It will contain artist id, name, their 3 most common tags, a list of all their tags, the year they were listened to most and a count of how many times they were listened to.
# function to get the most frequent tags for each artist, if the artist only has 1 tag it will repeat it
def get_top_n(tag_list, n):
try:
genre_n = tag_list.value_counts().index[n-1]
except:
genre_n = tag_list.value_counts().index[0]
return genre_n
# user_tagged_artists['tagValue'] = user_tagged_artists.tagID.map(tag_dict)
top_tag = user_tagged_artists_tag.groupby('artistID').tag.agg(lambda x:get_top_n(x, 1))
sec_tag = user_tagged_artists_tag.groupby('artistID').tag.agg(lambda x:get_top_n(x, 2))
third_tag = user_tagged_artists_tag.groupby('artistID').tag.agg(lambda x:get_top_n(x, 3))
all_tags = user_tagged_artists_tag.groupby('artistID').tag.agg(lambda x:x.unique().astype('str').tolist())
peak_year = user_tagged_artists.groupby('artistID').year.agg(lambda x:x.mode()[0]) #choses earliest year if there's a draw
artists_df = pd.concat([top_tag,sec_tag,third_tag,all_tags,peak_year],axis=1,keys=['tag_1', 'tag_2','tag_3','all_tags','peak_year']).reset_index()
artists_df.artistID = artists_df.artistID.astype('int')
artists_df = artists.merge(artists_df, right_on='artistID', left_on='id', how='left').drop(['url','artistID'], axis = 1)
listens_count = user_artists.groupby('artistID').size().to_frame('listen_count')
artists_df = pd.concat([artists_df, listens_count],axis=1)
artists_df.name = artists_df.name.astype('str')
artists_df.peak_year = artists_df.peak_year.fillna(2004).astype('int')
artists_df.tag_1 = artists_df.tag_1.fillna('no_tags')
artists_df.all_tags = artists_df.all_tags.fillna('[]')
artist_features['peak_year'] = peak_year
artist_features['all_tags'] = all_tags
artists_df.head(3)
| id | name | tag_1 | tag_2 | tag_3 | all_tags | peak_year | listen_count | |
|---|---|---|---|---|---|---|---|---|
| 0 | 0 | MALICE MIZER | jrock | visualkei | gothic | [weeabo, jrock, visualkei, betterthanladygaga,... | 2008 | 3 |
| 1 | 1 | Diary of Dreams | darkwave | german | gothic | [german, seenlive, darkwave, industrial, gothi... | 2009 | 12 |
| 2 | 2 | Carpathian Forest | blackmetal | truenorwegianblackmetal | norwegianblackmetal | [blackmetal, norwegianblackmetal, truenorwegia... | 2008 | 3 |
1.5.2. Listens Per Artist¶
A dataframe contain userID, artistID, and scaled listenCount.
listens = user_artists
listens = listens.rename(columns={'weight': 'listenCount'})
listens.head(3)
| userID | artistID | listenCount | |
|---|---|---|---|
| 0 | 0 | 45 | 3.047442 |
| 1 | 0 | 46 | 3.047442 |
| 2 | 0 | 47 | 3.047442 |
1.5.3. Save Files¶
artists.to_csv('..\\data\\processed\\artists.csv')
tags.to_csv('..\\data\\processed\\tags.csv')
user_artists.to_csv('..\\data\\processed\\user_artists.csv')
user_friends.to_csv('..\\data\\processed\\user_friends.csv')
user_tagged_artists.to_csv('..\\data\\processed\\user_tagged_artists.csv')
artists_df.to_csv('..\\data\\processed\\artist_info.csv')
listens.to_csv('..\\data\\processed\\listens.csv')
artist_features.to_csv('..\\data\\processed\\artist_features.csv')
2. Matrix Factorization¶
In this chapter we will build a recommendation system using matrix factorisation and TensorFlow. Matrix factorisation is a form of collaborative filtering which decomposes the user-item iteraction matrix into the product of two lower dimensionality matrices. Recommendations can often be improved by assigning regularization weights based on items’ popularity and users’ engagement levels.
The work done in this chapter involves reproducing and adopting the work done in this Google Colab for our cleaned Last.fm dataset.
2.1. Outline¶
Preliminaries
Training matrix factorization model
Inspect the Embeddings
Regularization in matrix factorization
2.2. Import required packages¶
import numpy as np
import pandas as pd
import collections
from ast import literal_eval
from mpl_toolkits.mplot3d import Axes3D
from IPython import display
from matplotlib import pyplot as plt
import sklearn
import sklearn.manifold
import tensorflow.compat.v1 as tf
# tf.disable_v2_behavior()
# tf.logging.set_verbosity(tf.logging.ERROR)
tf.compat.v1.disable_eager_execution()
# Add some convenience functions to Pandas DataFrame.
pd.options.display.max_rows = 10
pd.options.display.float_format = '{:.3f}'.format
def mask(df, key, function):
"""Returns a filtered dataframe, by applying function to key"""
return df[function(df[key])]
def flatten_cols(df):
df.columns = [' '.join(col).strip() for col in df.columns.values]
return df
pd.DataFrame.mask = mask
pd.DataFrame.flatten_cols = flatten_cols
import altair as alt
alt.data_transformers.enable('default', max_rows=None)
alt.renderers.enable('html')
RendererRegistry.enable('html')
2.3. Import Data¶
listens = pd.read_csv('..\\data\\processed\\listens.csv', index_col=0,encoding='utf-8')
artists_df = pd.read_csv('..\\data\\processed\\artist_info.csv', index_col=0,encoding='utf-8')
artists = pd.read_csv('..\\data\\processed\\artists.csv', index_col=0,encoding='utf-8')
artists_df.id = artists_df.id.astype('str')
artists_df.peak_year = artists_df.peak_year.astype('str')
artists_df.all_tags = artists_df.all_tags.apply(lambda x: literal_eval(x))
artists.id = artists.id.astype('str')
listens.userID = listens.userID.astype('str')
listens.artistID = listens.artistID.astype('str')
2.4. I. Preliminaries¶
Our goal is to factorize the listens matrix \(A\) into the product of a user embedding matrix \(U\) and artists embedding matrix \(V\), such that
\(A \approx UV^\top\) with
\(U = \begin{bmatrix} u_{1} \\ \hline \vdots \\ \hline u_{N} \end{bmatrix}\) and
\(V = \begin{bmatrix} v_{1} \\ \hline \vdots \\ \hline v_{M} \end{bmatrix}\).
Here
\(N\) is the number of users,
\(M\) is the number of artists,
\(A_{ij}\) is the listening count of the \(j\)th artist by the \(i\)th user,
each row \(U_i\) is a \(d\)-dimensional vector (embedding) representing user \(i\),
each row \(V_j\) is a \(d\)-dimensional vector (embedding) representing artist \(j\),
the prediction of the model for the \((i, j)\) pair is the dot product \(\langle U_i, V_j \rangle\).
2.4.1. Sparse Representation of the Rating Matrix¶
In general, most of the entries are unobserved, since a given user will only listen to a small subset of artists. For effcient representation, we will use a tf.SparseTensor. A SparseTensor uses three tensors to represent the matrix: tf.SparseTensor(indices, values, dense_shape) represents a tensor, where a value \(A_{ij} = a\) is encoded by setting indices[k] = [i, j] and values[k] = a. The last tensor dense_shape is used to specify the shape of the full underlying matrix.
Our dataset contains 1,892 users and 17,632 artists. Therefore, the dense_shape will be set to [1892,17632].
# Function that maps from listens DataFrame to a tf.SparseTensor.
def build_listens_sparse_tensor(listens_df):
"""
Args:
listens_df: a pd.DataFrame with `userID`, `artistID` and `listenCount` columns.
Returns:
a tf.SparseTensor representing the listens matrix.
"""
indices = listens_df[['userID', 'artistID']].values
values = listens_df['listenCount'].values
return tf.SparseTensor(
indices=indices,
values=values,
dense_shape=[listens.userID.nunique(), listens.artistID.nunique()])
2.4.2. Calculating the Error¶
The model approximates the ratings matrix \(A\) by a low-rank product \(UV^\top\). We need a way to measure the approximation error. We’ll use the Mean Squared Error of observed entries only. It is defined as
where \(\Omega\) is the set of observed ratings, and \(|\Omega|\) is the cardinality of \(\Omega\).
Function that takes a sparse listens matrix \(A\) and the two embedding matrices \(U, V\) and returns the mean squared error \(\text{MSE}(A, UV^\top)\).
def sparse_mean_square_error(sparse_listens, user_embeddings, artist_embeddings):
"""
Args:
sparse_listens: A SparseTensor listens matrix, of dense_shape [N, M]
user_embeddings: A dense Tensor U of shape [N, k] where k is the embedding
dimension, such that U_i is the embedding of user i.
artist_embeddings: A dense Tensor V of shape [M, k] where k is the embedding
dimension, such that V_j is the embedding of artist j.
Returns:
A scalar Tensor representing the MSE between the true ratings and the
model's predictions.
"""
predictions = tf.gather_nd(
tf.matmul(user_embeddings, artist_embeddings, transpose_b=True),
sparse_listens.indices)
loss = tf.losses.mean_squared_error(sparse_listens.values, predictions)
return loss
Computes the full prediction matrix UV⊤, then gather the entries corresponding to the observed pairs. The memory cost of this approach is O(NM). For the Lastfm dataset, this is fine, as the dense N×M matrix is small enough to fit in memory ( N=1892, M=17632 ).
2.5. II. Training the Matrix Factorization model¶
2.5.1. CFModel (Collaborative Filtering Model) helper class¶
This is a simple class to train a matrix factorization model using stochastic gradient descent.
The class constructor takes
the user embeddings U (a
tf.Variable).the artist embeddings V, (a
tf.Variable).a loss to optimize (a
tf.Tensor).an optional list of metrics dictionaries, each mapping a string (the name of the metric) to a tensor. These are evaluated and plotted during training (e.g. training error and test error).
2.5.1.1. CFModel (Collaborative Filtering Model)¶
class CFModel(object):
"""Simple class that represents a collaborative filtering model"""
def __init__(self, embedding_vars, loss, metrics=None):
"""Initializes a CFModel.
Args:
embedding_vars: A dictionary of tf.Variables.
loss: A float Tensor. The loss to optimize.
metrics: optional list of dictionaries of Tensors. The metrics in each
dictionary will be plotted in a separate figure during training.
"""
self._embedding_vars = embedding_vars
self._loss = loss
self._metrics = metrics
self._embeddings = {k: None for k in embedding_vars}
self._session = None
@property
def embeddings(self):
"""The embeddings dictionary."""
return self._embeddings
def train(self, num_iterations=10, learning_rate=1.0, plot_results=True,
optimizer=tf.train.GradientDescentOptimizer):
"""Trains the model.
Args:
iterations: number of iterations to run.
learning_rate: optimizer learning rate.
plot_results: whether to plot the results at the end of training.
optimizer: the optimizer to use. Default to GradientDescentOptimizer.
Returns:
The metrics dictionary evaluated at the last iteration.
"""
with self._loss.graph.as_default():
opt = optimizer(learning_rate)
train_op = opt.minimize(self._loss)
local_init_op = tf.group(
tf.variables_initializer(opt.variables()),
tf.local_variables_initializer())
if self._session is None:
self._session = tf.Session()
with self._session.as_default():
self._session.run(tf.global_variables_initializer())
self._session.run(tf.tables_initializer())
# tf.train.start_queue_runners()
with self._session.as_default():
local_init_op.run()
iterations = []
metrics = self._metrics or ({},)
metrics_vals = [collections.defaultdict(list) for _ in self._metrics]
# Train and append results.
for i in range(num_iterations + 1):
_, results = self._session.run((train_op, metrics))
if (i % 10 == 0) or i == num_iterations:
print("\r iteration %d: " % i + ", ".join(
["%s=%f" % (k, v) for r in results for k, v in r.items()]),
end='')
iterations.append(i)
for metric_val, result in zip(metrics_vals, results):
for k, v in result.items():
metric_val[k].append(v)
for k, v in self._embedding_vars.items():
self._embeddings[k] = v.eval()
if plot_results:
# Plot the metrics.
num_subplots = len(metrics)+1
fig = plt.figure()
fig.set_size_inches(num_subplots*10, 8)
for i, metric_vals in enumerate(metrics_vals):
ax = fig.add_subplot(1, num_subplots, i+1)
for k, v in metric_vals.items():
ax.plot(iterations, v, label=k)
ax.set_xlim([1, num_iterations])
ax.legend()
return results
2.5.1.2. Matrix Factorization model¶
# Utility to split the data into training and test sets.
def split_dataframe(df, holdout_fraction=0.1):
"""Splits a DataFrame into training and test sets.
Args:
df: a dataframe.
holdout_fraction: fraction of dataframe rows to use in the test set.
Returns:
train: dataframe for training
test: dataframe for testing
"""
test = df.sample(frac=holdout_fraction, replace=False)
train = df[~df.index.isin(test.index)]
return train, test
def build_model(listens, embedding_dim=3, init_stddev=1.):
"""
Args:
listens: a DataFrame of the listen counts
embedding_dim: the dimension of the embedding vectors.
init_stddev: float, the standard deviation of the random initial embeddings.
Returns:
model: a CFModel.
"""
# Split the listens DataFrame into train and test.
train_listens, test_listens = split_dataframe(listens)
# SparseTensor representation of the train and test datasets.
A_train = build_listens_sparse_tensor(train_listens)
A_test = build_listens_sparse_tensor(test_listens)
# Initialize the embeddings using a normal distribution.
U = tf.Variable(tf.random_normal(
[A_train.dense_shape[0], embedding_dim], stddev=init_stddev))
V = tf.Variable(tf.random_normal(
[A_train.dense_shape[1], embedding_dim], stddev=init_stddev))
train_loss = sparse_mean_square_error(A_train, U, V)
test_loss = sparse_mean_square_error(A_test, U, V)
metrics = {
'train_error': train_loss,
'test_error': test_loss
}
embeddings = {
"userID": U,
"artistID": V
}
return CFModel(embeddings, train_loss, [metrics])
2.5.1.3. Train the Matrix Factorization model¶
# Build the CF model and train it.
model = build_model(listens, embedding_dim=30, init_stddev=0.5)
model.train(num_iterations=1000, learning_rate=10.)
iteration 1000: train_error=0.105881, test_error=2.284283
[{'train_error': 0.1058807, 'test_error': 2.2842832}]
A sharp drop in the training error is observered. However, the test error, as expected, has a less pronounced drop in error rate. The test error pretty quickly plateaus around 2.6.
2.6. III. Inspecting the Embeddings¶
In this section, we take a closer look at the learned embeddings, by
computing recommendations
looking at the nearest neighbors of some artists,
looking at the norms of the artist embeddings,
visualizing the embedding in a projected embedding space.
2.6.1. Function to compute the scores of the candidates¶
We start by writing a function that, given a query embedding \(u \in \mathbb R^d\) and item embeddings \(V \in \mathbb R^{N \times d}\), computes the item scores.
There are different similarity measures we can use, and these can yield different results. We will compare the following:
dot product: the score of item j is \(\langle u, V_j \rangle\).
cosine: the score of item j is \(\frac{\langle u, V_j \rangle}{\|u\|\|V_j\|}\).
DOT = 'dot'
COSINE = 'cosine'
def compute_scores(query_embedding, item_embeddings, measure=DOT):
"""Computes the scores of the candidates given a query.
Args:
query_embedding: a vector of shape [k], representing the query embedding.
item_embeddings: a matrix of shape [N, k], such that row i is the embedding
of item i.
measure: a string specifying the similarity measure to be used. Can be
either DOT or COSINE.
Returns:
scores: a vector of shape [N], such that scores[i] is the score of item i.
"""
u = query_embedding
V = item_embeddings
if measure == COSINE:
V = V / np.linalg.norm(V, axis=1, keepdims=True)
u = u / np.linalg.norm(u)
scores = u.dot(V.T)
return scores
Equipped with this function, we can compute recommendations, where the query embedding can be either a user embedding or an artist embedding.
2.6.2. Artist Nearest Neighbors¶
def artist_neighbors(model, artist_substring, measure=DOT, k=6):
# Search for artist ids that match the given substring.
ids = artists[artists['name'].str.contains(artist_substring)].index.values
names = artists.iloc[ids]['name'].values
if len(names) == 0:
raise ValueError("Found no artists with name %s" % artist_substring)
print("Nearest neighbors of : %s." % names[0])
if len(names) > 1:
print("[Found more than one matching artist. Other candidates: {}]".format(
", ".join(names[1:])))
artist_id = ids[0]
scores = compute_scores(
model.embeddings["artistID"][artist_id], model.embeddings["artistID"],
measure)
score_key = measure + ' score'
df = pd.DataFrame({
score_key: list(scores),
'names': artists['name'],
})
display.display(df.sort_values([score_key], ascending=False).head(k))
artist_neighbors(model, 'Coldplay', DOT)
artist_neighbors(model, 'Coldplay', COSINE)
Nearest neighbors of : Coldplay.
[Found more than one matching artist. Other candidates: Jay-Z & Coldplay, Coldplay/U2]
| dot score | names | |
|---|---|---|
| 167 | 3.899 | Placebo |
| 59 | 3.862 | Coldplay |
| 148 | 3.771 | Radiohead |
| 10156 | 3.720 | The Heart Attacks |
| 9079 | 3.694 | Chris Lambert |
| 698 | 3.668 | The Pretty Reckless |
Nearest neighbors of : Coldplay.
[Found more than one matching artist. Other candidates: Jay-Z & Coldplay, Coldplay/U2]
| cosine score | names | |
|---|---|---|
| 59 | 1.000 | Coldplay |
| 184 | 0.931 | Muse |
| 1391 | 0.928 | MGMT |
| 222 | 0.919 | Kings of Leon |
| 214 | 0.909 | Red Hot Chili Peppers |
| 371 | 0.907 | Linkin Park |
These recommendations based on similar artists to Coldplay are decent. Mostof the are bands, primarily rock bands. Lady Gaga appears to be odd suggestion, but it is conceivable that listening patterns for Coldplay and Lady Gaga are similar as both artist would have been popular around the same time. Both similarity measures result in the recommendation of low popularity artists such as Lich King and Alanis Morissette.
2.6.3. Artist Embedding Norm¶
With dot-product, the model tends to recommend popular artists. This can be explained by the fact that in matrix factorization models, the norm of the embedding is often correlated with popularity (popular artists have a larger norm), which makes it more likely to recommend more popular items. We can confirm this hypothesis by sorting the artists by their embedding norm, as done in the next cell.
def artist_embedding_norm(models):
"""Visualizes the norm and number of ratings of the artist embeddings.
Args:
model: A MFModel object.
"""
if not isinstance(models, list):
models = [models]
df = pd.DataFrame({
'name': artists_df['name'],
'tag': artists_df['tag_1'],
'listen_count': artists_df['listen_count'],
})
charts = []
brush = alt.selection_interval()
for i, model in enumerate(models):
norm_key = 'norm'+str(i)
df[norm_key] = np.linalg.norm(model.embeddings["artistID"], axis=1)
nearest = alt.selection(
type='single', encodings=['x', 'y'], on='mouseover', nearest=True,
empty='none')
base = alt.Chart().mark_circle().encode(
x='listen_count',
y=norm_key,
color=alt.condition(brush, alt.value('#4c78a8'), alt.value('lightgray'))
).properties(
selection=nearest).add_selection(brush)
text = alt.Chart().mark_text(align='center', dx=5, dy=-5).encode(
x='listen_count', y=norm_key,
text=alt.condition(nearest, 'name', alt.value('')))
charts.append(alt.layer(base, text))
return alt.hconcat(*charts, data=df)
def visualize_artist_embeddings(data, x, y):
nearest = alt.selection(
type='single', encodings=['x', 'y'], on='mouseover', nearest=True,
empty='none')
base = alt.Chart().mark_circle().encode(
x=x,
y=y,
# color=alt.condition(genre_filter, "tag", alt.value("whitesmoke")),
).properties(
width=600,
height=600,
selection=nearest)
text = alt.Chart().mark_text(align='left', dx=5, dy=-5).encode(
x=x,
y=y,
text=alt.condition(nearest, 'name', alt.value('')))
return alt.hconcat(alt.layer(base, text), data=data)
def tsne_artist_embeddings(model):
"""Visualizes the artist embeddings, projected using t-SNE with Cosine measure.
Args:
model: A MFModel object.
"""
tsne = sklearn.manifold.TSNE(
n_components=2, perplexity=40, metric='cosine', early_exaggeration=10.0,
init='pca', verbose=True, n_iter=400)
print('Running t-SNE...')
V_proj = tsne.fit_transform(model.embeddings["artistID"])
artists_df.loc[:,'x'] = V_proj[:, 0]
artists_df.loc[:,'y'] = V_proj[:, 1]
return visualize_artist_embeddings(artists_df, 'x', 'y')
artist_embedding_norm(model)
model_lowinit = build_model(listens, embedding_dim=30, init_stddev=0.05)
model_lowinit.train(num_iterations=1000, learning_rate=10.)
artist_embedding_norm([model, model_lowinit])
iteration 1000: train_error=0.228160, test_error=0.625372
Lady Gaga, the most popular artist in the dataset is the dot furthest to the right. Hovering over the dots on the right side of the visualisation, I recognize almost all of them. However, I do not recognize most artists on the left. This norm of the embedding is now correlated with popularity. This has caused the test error to greatly reduce to 0.6. The model is improving.
artist_neighbors(model_lowinit, "Coldplay", DOT)
artist_neighbors(model_lowinit, "Coldplay", COSINE)
Nearest neighbors of : Coldplay.
[Found more than one matching artist. Other candidates: Jay-Z & Coldplay, Coldplay/U2]
| dot score | names | |
|---|---|---|
| 59 | 3.535 | Coldplay |
| 184 | 2.294 | Muse |
| 221 | 2.103 | The Beatles |
| 223 | 1.998 | The Killers |
| 222 | 1.980 | Kings of Leon |
| 1089 | 1.897 | Björk |
Nearest neighbors of : Coldplay.
[Found more than one matching artist. Other candidates: Jay-Z & Coldplay, Coldplay/U2]
| cosine score | names | |
|---|---|---|
| 59 | 1.000 | Coldplay |
| 9010 | 0.754 | Gufi |
| 486 | 0.746 | Funeral for a Friend |
| 7182 | 0.738 | Relespública |
| 7484 | 0.738 | Tocotronic |
| 1961 | 0.724 | The Black Keys |
The relevance of the recommendations made using the dot product has correspondingly increased.
2.6.4. Embedding visualization¶
Since it is hard to visualize embeddings in a higher-dimensional space (when the embedding dimension k>3 ), one approach is to project the embeddings to a lower dimensional space. T-SNE (T-distributed Stochastic Neighbor Embedding) is an algorithm that projects the embeddings while attempting to preserve their pariwise distances.
tsne_artist_embeddings(model_lowinit)
Running t-SNE...
[t-SNE] Computing 121 nearest neighbors...
[t-SNE] Indexed 17632 samples in 0.001s...
C:\Users\User\anaconda3\envs\ml_recomsys\lib\site-packages\sklearn\manifold\_t_sne.py:790: FutureWarning: The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2.
warnings.warn(
C:\Users\User\anaconda3\envs\ml_recomsys\lib\site-packages\sklearn\manifold\_t_sne.py:819: FutureWarning: 'square_distances' has been introduced in 0.24 to help phase out legacy squaring behavior. The 'legacy' setting will be removed in 1.1 (renaming of 0.26), and the default setting will be changed to True. In 1.3, 'square_distances' will be removed altogether, and distances will be squared by default. Set 'square_distances'=True to silence this warning.
warnings.warn(
[t-SNE] Computed neighbors for 17632 samples in 11.678s...
[t-SNE] Computed conditional probabilities for sample 1000 / 17632
[t-SNE] Computed conditional probabilities for sample 2000 / 17632
[t-SNE] Computed conditional probabilities for sample 3000 / 17632
[t-SNE] Computed conditional probabilities for sample 4000 / 17632
[t-SNE] Computed conditional probabilities for sample 5000 / 17632
[t-SNE] Computed conditional probabilities for sample 6000 / 17632
[t-SNE] Computed conditional probabilities for sample 7000 / 17632
[t-SNE] Computed conditional probabilities for sample 8000 / 17632
[t-SNE] Computed conditional probabilities for sample 9000 / 17632
[t-SNE] Computed conditional probabilities for sample 10000 / 17632
[t-SNE] Computed conditional probabilities for sample 11000 / 17632
[t-SNE] Computed conditional probabilities for sample 12000 / 17632
[t-SNE] Computed conditional probabilities for sample 13000 / 17632
[t-SNE] Computed conditional probabilities for sample 14000 / 17632
[t-SNE] Computed conditional probabilities for sample 15000 / 17632
[t-SNE] Computed conditional probabilities for sample 16000 / 17632
[t-SNE] Computed conditional probabilities for sample 17000 / 17632
[t-SNE] Computed conditional probabilities for sample 17632 / 17632
[t-SNE] Mean sigma: 0.146395
C:\Users\User\anaconda3\envs\ml_recomsys\lib\site-packages\sklearn\manifold\_t_sne.py:982: FutureWarning: The PCA initialization in TSNE will change to have the standard deviation of PC1 equal to 1e-4 in 1.2. This will ensure better convergence.
warnings.warn(
[t-SNE] KL divergence after 100 iterations with early exaggeration: 79.401367
[t-SNE] KL divergence after 400 iterations: 4.176984
There is not much structure to this graph.
2.7. IV. Regularization In Matrix Factorization¶
In the previous section, our loss was defined as the mean squared error on the observed part of the rating matrix. This can be problematic as the model does not learn how to place the embeddings of irrelevant artists. This phenomenon is known as folding.
We will add regularization terms that will address this issue. We will use two types of regularization:
Regularization of the model parameters. This is a common ℓ2 regularization term on the embedding matrices, given by r(U,V)=1N∑i∥Ui∥2+1M∑j∥Vj∥2 . A global prior that pushes the prediction of any pair towards zero, called the gravity term. This is given by g(U,V)=1MN∑Ni=1∑Mj=1⟨Ui,Vj⟩2 . The total loss is then given by 1|Ω|∑(i,j)∈Ω(Aij−⟨Ui,Vj⟩)2+λrr(U,V)+λgg(U,V) where λr and λg are two regularization coefficients (hyper-parameters).
2.7.1. Build a regularized Matrix Factorization model and train it¶
def gravity(U, V):
"""Creates a gravity loss given two embedding matrices."""
return 1. / (U.shape[0]*V.shape[0]) * tf.reduce_sum(
tf.matmul(U, U, transpose_a=True) * tf.matmul(V, V, transpose_a=True))
def build_regularized_model(
ratings, embedding_dim=3, regularization_coeff=.1, gravity_coeff=1.,
init_stddev=0.1):
"""
Args:
listens: the DataFrame of artist listen counts.
embedding_dim: The dimension of the embedding space.
regularization_coeff: The regularization coefficient lambda.
gravity_coeff: The gravity regularization coefficient lambda_g.
Returns:
A CFModel object that uses a regularized loss.
"""
# Split the ratings DataFrame into train and test.
train_ratings, test_ratings = split_dataframe(ratings)
# SparseTensor representation of the train and test datasets.
A_train = build_listens_sparse_tensor(train_ratings)
A_test = build_listens_sparse_tensor(test_ratings)
U = tf.Variable(tf.random_normal(
[A_train.dense_shape[0], embedding_dim], stddev=init_stddev))
V = tf.Variable(tf.random_normal(
[A_train.dense_shape[1], embedding_dim], stddev=init_stddev))
error_train = sparse_mean_square_error(A_train, U, V)
error_test = sparse_mean_square_error(A_test, U, V)
gravity_loss = gravity_coeff * gravity(U, V)
regularization_loss = regularization_coeff * (
tf.reduce_sum(U*U)/U.shape[0] + tf.reduce_sum(V*V)/V.shape[0])
total_loss = error_train + regularization_loss + gravity_loss
losses = {
'train_error_observed': error_train,
'test_error_observed': error_test,
}
loss_components = {
'observed_loss': error_train,
'regularization_loss': regularization_loss,
'gravity_loss': gravity_loss,
}
embeddings = {"userID": U, "artistID": V}
return CFModel(embeddings, total_loss, [losses, loss_components])
reg_model = build_regularized_model(
listens, regularization_coeff=0.1, gravity_coeff=1.0, embedding_dim=35,
init_stddev=.05)
reg_model.train(num_iterations=2000, learning_rate=20.)
iteration 2000: train_error_observed=0.123008, test_error_observed=0.829953, observed_loss=0.123008, regularization_loss=0.255787, gravity_loss=0.093286
[{'train_error_observed': 0.12300799, 'test_error_observed': 0.8299535},
{'observed_loss': 0.12300799,
'regularization_loss': 0.25578678,
'gravity_loss': 0.093285955}]
Adding the regularization terms results in a slightly higher MSE on the training set, but considerably lowers the MSE for the test set. This trade-off is worthwhile as it will ultimately result in better recommendations.
In the following cells, we display the nearest neighbors, the embedding norms, and the t-SNE projection of the artist embeddings.
artist_neighbors(reg_model, "Coldplay", DOT)
artist_neighbors(reg_model, "Coldplay", COSINE)
Nearest neighbors of : Coldplay.
[Found more than one matching artist. Other candidates: Jay-Z & Coldplay, Coldplay/U2]
| dot score | names | |
|---|---|---|
| 59 | 15.723 | Coldplay |
| 3037 | 8.765 | Regina Spektor |
| 201 | 8.536 | Arctic Monkeys |
| 224 | 8.522 | Green Day |
| 505 | 8.113 | U2 |
| 148 | 8.056 | Radiohead |
Nearest neighbors of : Coldplay.
[Found more than one matching artist. Other candidates: Jay-Z & Coldplay, Coldplay/U2]
| cosine score | names | |
|---|---|---|
| 59 | 1.000 | Coldplay |
| 8927 | 0.793 | Mark Ronson |
| 10630 | 0.776 | Jam & Spoon |
| 10632 | 0.762 | 7 Days Away |
| 10629 | 0.761 | Missing Hours |
| 3655 | 0.748 | Everything Everything |
The recommendations here seem to have improved when using dot score, except for the unusual recommendation of Regina Spektor. The rest of the recommendations are highly relevant. However, the recommendations based on cosine similarity seem to have decreased in usefulness. It is now recommending artists with a very low populartiy.
Here we compare the embedding norms for model and reg_model.
artist_embedding_norm([model, model_lowinit, reg_model])
The embedding norms for reg_model now follow a nice curve, listen_count is clearly correlated to norm2.
# Visualize the embeddings
tsne_artist_embeddings(reg_model)
Running t-SNE...
[t-SNE] Computing 121 nearest neighbors...
[t-SNE] Indexed 17632 samples in 0.016s...
C:\Users\User\anaconda3\envs\ml_recomsys\lib\site-packages\sklearn\manifold\_t_sne.py:790: FutureWarning: The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2.
warnings.warn(
C:\Users\User\anaconda3\envs\ml_recomsys\lib\site-packages\sklearn\manifold\_t_sne.py:819: FutureWarning: 'square_distances' has been introduced in 0.24 to help phase out legacy squaring behavior. The 'legacy' setting will be removed in 1.1 (renaming of 0.26), and the default setting will be changed to True. In 1.3, 'square_distances' will be removed altogether, and distances will be squared by default. Set 'square_distances'=True to silence this warning.
warnings.warn(
[t-SNE] Computed neighbors for 17632 samples in 25.751s...
[t-SNE] Computed conditional probabilities for sample 1000 / 17632
[t-SNE] Computed conditional probabilities for sample 2000 / 17632
[t-SNE] Computed conditional probabilities for sample 3000 / 17632
[t-SNE] Computed conditional probabilities for sample 4000 / 17632
[t-SNE] Computed conditional probabilities for sample 5000 / 17632
[t-SNE] Computed conditional probabilities for sample 6000 / 17632
[t-SNE] Computed conditional probabilities for sample 7000 / 17632
[t-SNE] Computed conditional probabilities for sample 8000 / 17632
[t-SNE] Computed conditional probabilities for sample 9000 / 17632
[t-SNE] Computed conditional probabilities for sample 10000 / 17632
[t-SNE] Computed conditional probabilities for sample 11000 / 17632
[t-SNE] Computed conditional probabilities for sample 12000 / 17632
[t-SNE] Computed conditional probabilities for sample 13000 / 17632
[t-SNE] Computed conditional probabilities for sample 14000 / 17632
[t-SNE] Computed conditional probabilities for sample 15000 / 17632
[t-SNE] Computed conditional probabilities for sample 16000 / 17632
[t-SNE] Computed conditional probabilities for sample 17000 / 17632
[t-SNE] Computed conditional probabilities for sample 17632 / 17632
[t-SNE] Mean sigma: 0.242763
C:\Users\User\anaconda3\envs\ml_recomsys\lib\site-packages\sklearn\manifold\_t_sne.py:982: FutureWarning: The PCA initialization in TSNE will change to have the standard deviation of PC1 equal to 1e-4 in 1.2. This will ensure better convergence.
warnings.warn(
[t-SNE] KL divergence after 250 iterations with early exaggeration: 78.572250
[t-SNE] KL divergence after 400 iterations: 2.660660
The embeddings have somewhat more structure than the unregularized case. There appears to be two pretty strong clusters and a couple more less prominent clusters.
2.8. Conclusion¶
This concludes the section on matrix factorization models. We have successfully built a recommender system using matrix factorization. Initially, the recommendations were weak, the model was poor at identifying and ignoring irrelavant artists. We addressed this and improved the quality of recommendations through regularlization. By inspecting the embeddings and experimenting with similarity measures, I have concluded that the dot product is more appropiate for this use case and prodcues better suggestions. However, cosine similrity is better at capturing specific user interests and could be used to help users discover new, lesser know artists.
3. Softmax model¶
In this section, we will train a simple softmax model that predicts whether a given user has listened to an artist. The model will take as input a feature vector \(x\) representing the list of artists the user has listened to. Softmax, sometimes referred to as multinomial logistic regression, is a form of logistic regression. Softmax treats the problem as a multiclass prediction problem and will calculate the probability a user has listened to a certain song.
3.1. Outline¶
Batch Generation
Loss Function
Build, Train, Inspect
3.2. Create DataFrame¶
listened_artists = (listens[["userID", "artistID"]]
.groupby("userID", as_index=False)
.aggregate(lambda x: list(x.apply(str))))
listened_artists.userID = listened_artists.userID.astype('str')
listened_artists.head()
| userID | artistID | |
|---|---|---|
| 0 | 0 | [45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 5... |
| 1 | 1 | [95, 96, 97, 98, 99, 100, 101, 102, 103, 104, ... |
| 2 | 10 | [66, 183, 185, 224, 282, 294, 327, 338, 371, 3... |
| 3 | 100 | [597, 610, 735, 739, 744, 746, 747, 763, 769, ... |
| 4 | 1000 | [49, 50, 58, 59, 61, 65, 83, 251, 282, 283, 28... |
3.3. Batch generation¶
We then create a function that generates an example batch, such that each example contains the following features:
artistID: A tensor of strings of the artist ids that the user listened to.
tag: A tensor of strings of the tags of those artists
year: A tensor of strings of the peak year.
years_dict = {
artist: year for artist, year in zip(artists_df["id"], artists_df["peak_year"])
}
tags_dict = {
artist: tags
for artist, tags in zip(artists_df["id"], artists_df["all_tags"])
}
def make_batch(listens, batch_size):
"""Creates a batch of examples.
Args:
listens: A DataFrame of ratings such that examples["artistID"] is a list of
artists listened to by a user.
batch_size: The batch size.
"""
def pad(x, fill):
return pd.DataFrame.from_dict(x).fillna(fill).values
artist = []
year = []
tag = []
label = []
for artistIDs in listens["artistID"].values:
artist.append(artistIDs)
tag.append([x for artistID in artistIDs for x in tags_dict[artistID]])
year.append([years_dict[artistID] for artistID in artistIDs])
label.append([int(artistID) for artistID in artistIDs])
features = {
"id": pad(artist, ""),
"peak_year": pad(year, ""),
"tag_1": pad(tag, ""),
"label": pad(label, -1)
}
print('making batch')
global tmp
tmp = features
batch = (
tf.data.Dataset.from_tensor_slices(features)
.shuffle(1000)
.repeat()
.batch(batch_size)
.make_one_shot_iterator()
.get_next())
return batch
def select_random(x):
"""Selectes a random elements from each row of x."""
def to_float(x):
return tf.cast(x, tf.float32)
def to_int(x):
return tf.cast(x, tf.int64)
batch_size = tf.shape(x)[0]
rn = tf.range(batch_size)
nnz = to_float(tf.count_nonzero(x >= 0, axis=1))
rnd = tf.random_uniform([batch_size])
ids = tf.stack([to_int(rn), to_int(nnz * rnd)], axis=1)
return to_int(tf.gather_nd(x, ids))
3.4. Loss function¶
Tthe softmax model maps the input features \(x\) to a user embedding \(\psi(x) \in \mathbb R^d\), where \(d\) is the embedding dimension. This vector is then multiplied by an artist embedding matrix \(V \in \mathbb R^{m \times d}\) (where \(m\) is the number of artists), and the final output of the model is the softmax of the product $\( \hat p(x) = \text{softmax}(\psi(x) V^\top). \)\( Given a target label \)y\(, if we denote by \)p = 1_y\( a one-hot encoding of this target label, then the loss is the cross-entropy between \)\hat p(x)\( and \)p$.
We will write a function that takes tensors representing the user embeddings ψ(x) , movie embeddings V , target label y , and return the cross-entropy loss.
def softmax_loss(user_embeddings, artist_embeddings, labels):
"""Returns the cross-entropy loss of the softmax model.
Args:
user_embeddings: A tensor of shape [batch_size, embedding_dim].
artist_embeddings: A tensor of shape [num_artists, embedding_dim].
labels: A tensor of [batch_size], such that labels[i] is the target label
for example i.
Returns:
The mean cross-entropy loss.
"""
# Verify that the embddings have compatible dimensions
user_emb_dim = user_embeddings.shape[1]
artist_emb_dim = artist_embeddings.shape[1]
if user_emb_dim != artist_emb_dim:
raise ValueError(
"The user embedding dimension %d should match the artist embedding "
"dimension % d" % (user_emb_dim, artist_emb_dim))
logits = tf.matmul(user_embeddings, artist_embeddings, transpose_b=True)
loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(
logits=logits, labels=labels))
return loss
3.5. Build a softmax model, train it, and inspect its embeddings.¶
We are now ready to build a softmax CFModel. The architecture of the model is defined in the function create_user_embeddings and illustrated in the figure below. The input embeddings (artistID, tag_1 and peak_year) are concatenated to form the input layer, then we have hidden layers with dimensions specified by the hidden_dims argument. Finally, the last hidden layer is multiplied by the artist embeddings to obtain the logits layer. For the target label, we will use a randomly-sampled artistID from the list of artists the user has listened to.

3.5.1. Build¶
def build_softmax_model(listened_artists, embedding_cols, hidden_dims):
"""Builds a Softmax model for lastfm.
Args:
listened_artists: DataFrame of traing examples.
embedding_cols: A dictionary mapping feature names (string) to embedding
column objects. This will be used in tf.feature_column.input_layer() to
create the input layer.
hidden_dims: int list of the dimensions of the hidden layers.
Returns:
A CFModel object.
"""
def create_network(features):
"""Maps input features dictionary to user embeddings.
Args:
features: A dictionary of input string tensors.
Returns:
outputs: A tensor of shape [batch_size, embedding_dim].
"""
# Create a bag-of-words embedding for each sparse feature.
inputs = tf.feature_column.input_layer(features, embedding_cols)
# Hidden layers.
input_dim = inputs.shape[1]
for i, output_dim in enumerate(hidden_dims):
w = tf.get_variable(
"hidden%d_w_" % i, shape=[input_dim, output_dim],
initializer=tf.truncated_normal_initializer(
stddev=1./np.sqrt(output_dim))) / 10.
outputs = tf.matmul(inputs, w)
input_dim = output_dim
inputs = outputs
return outputs
train_listened_artists, test_listened_artists = split_dataframe(listened_artists)
train_batch = make_batch(train_listened_artists, 200)
test_batch = make_batch(test_listened_artists, 100)
with tf.variable_scope("model", reuse=False):
# Train
train_user_embeddings = create_network(train_batch)
train_labels = select_random(train_batch["label"])
with tf.variable_scope("model", reuse=True):
# Test
test_user_embeddings = create_network(test_batch)
test_labels = select_random(test_batch["label"])
artist_embeddings = tf.get_variable(
"input_layer/id_embedding/embedding_weights")
test_loss = softmax_loss(
test_user_embeddings, artist_embeddings, test_labels)
train_loss = softmax_loss(
train_user_embeddings, artist_embeddings, train_labels)
_, test_precision_at_10 = tf.metrics.precision_at_k(
labels=test_labels,
predictions=tf.matmul(test_user_embeddings, artist_embeddings, transpose_b=True),
k=10)
metrics = (
{"train_loss": train_loss, "test_loss": test_loss},
{"test_precision_at_10": test_precision_at_10}
)
embeddings = {"artistID": artist_embeddings}
return CFModel(embeddings, train_loss, metrics)
3.5.2. Train¶
We are now ready to train the softmax model. The following hyperparameters can be set:
learning rate
number of iterations. Note: you can run
softmax_model.train()again to continue training the model from its current state.input embedding dimensions (the
input_dimsargument)number of hidden layers and size of each layer (the
hidden_dimsargument)
Note: since our input features are string-valued (artistID, tag_1, and peak_year), we need to map them to integer ids. This is done using tf.feature_column.categorical_column_with_vocabulary_list, which takes a vocabulary list specifying all the values the feature can take. Then each id is mapped to an embedding vector using tf.feature_column.embedding_column.
# Create feature embedding columns
def make_embedding_col(key, embedding_dim):
categorical_col = tf.feature_column.categorical_column_with_vocabulary_list(
key=key, vocabulary_list=list(set(artists_df[key].values)), num_oov_buckets=0)
return tf.feature_column.embedding_column(
categorical_column=categorical_col, dimension=embedding_dim,
# default initializer: trancated normal with stddev=1/sqrt(dimension)
combiner='mean')
with tf.Graph().as_default():
softmax_model = build_softmax_model(
listened_artists,
embedding_cols=[
make_embedding_col("id", 35),
# make_embedding_col("tag", 3),
# make_embedding_col("peak_year", 2),
],
hidden_dims=[35])
making batch
softmax_model.train(
learning_rate=8., num_iterations=3000, optimizer=tf.train.AdagradOptimizer)
# change iterations to 3000
WARNING:tensorflow:From /usr/local/lib/python3.7/dist-packages/tensorflow/python/training/adagrad.py:143: calling Constant.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
iteration 3000: train_loss=6.966783, test_loss=7.547706, test_precision_at_10=0.007220
({'test_loss': 7.5477057, 'train_loss': 6.966783},
{'test_precision_at_10': 0.007219926691102965})
The train loss is higher than the loss seen in previous models. Precision does improve with more training, however it remains low, reaching a maximum value of 0.0072. Precision for recommender systems is generally low as we are predicting items users might be interested in out of a large set of items. It is hard to tell if a user would actually be interested in an item if it is never presented to them as an option.
3.5.3. Inspect Embeddings¶
We can inspect the artist embeddings as we did for the previous models. Note that in this case, the artist embeddings are used at the same time as input embeddings (for the bag of words representation of the user listening history), and as softmax weights.
artist_neighbors(softmax_model, "Coldplay", DOT)
artist_neighbors(softmax_model, "Coldplay", COSINE)
Nearest neighbors of : Coldplay.
[Found more than one matching artist. Other candidates: Jay-Z & Coldplay, Coldplay/U2]
| dot score | names | |
|---|---|---|
| 59 | 36.792 | Coldplay |
| 223 | 33.686 | The Killers |
| 184 | 32.686 | Muse |
| 527 | 31.506 | Oasis |
| 214 | 31.132 | Red Hot Chili Peppers |
| 201 | 29.505 | Arctic Monkeys |
Nearest neighbors of : Coldplay.
[Found more than one matching artist. Other candidates: Jay-Z & Coldplay, Coldplay/U2]
| cosine score | names | |
|---|---|---|
| 59 | 1.000 | Coldplay |
| 1366 | 0.950 | Snow Patrol |
| 223 | 0.939 | The Killers |
| 310 | 0.925 | Alanis Morissette |
| 527 | 0.923 | Oasis |
| 165 | 0.913 | Stereophonics |
3.6. Conclusion¶
These recommendations are highly relevant. Although the loss is hihger,in my opinion the recommendations are superior to the recommendations we were receiving using the previous matrix factorization moodels. We have expanded on our previous work by building a softmax model that is capable of making relevant high quality recommendations.
4. LightGCN¶
Graphs are versatile data strucutres that can model complex elements and relationships. In this chapter I implement a Light Graph Convolution Network (LightGNC) to make recommendations. This work utilises a recommender library developed by Microsoft, instructions on installation can be found here. The library provides utilities to aid common recommendation building tasks such as data cleaning, test/train splitting and the implementation of algorithms.
4.1. Outline¶
Overview of LightGCN
Prepare data and hyper-parameters
Create and train model¶
Recommendations and evaluation¶
4.2. LightGCN Overview & Architecture¶
Graph Convolution Network (GCNs) approaches involve semi-supervised learning on graph-structured data. Many real-world datasets come in the form of property graphs, yet until recently little effort has been devoted to the generalization of neural network models to graph structured datasets. GCNs are based on an efficient variant of convolutional neural networks. Convolutional architecure allow the to scale linearly and learn hidden layer representations.
LightGCN is a simplified design of GCN, more concise and appropriate for recommenders. The model architecture is illustrated below.
In Light Graph Convolution, only the normalized sum of neighbor embeddings is performed towards next layer; other operations like self-connection, feature transformation, and nonlinear activation are all removed, which largely simplifies GCNs. In Layer Combination,the embeddings at each layer are summed over to achieve the final representations.
4.2.1. Light Graph Convolution (LGC)¶
In LightGCN, a simple weighted sum aggregator is utilised. The graph convolution operation in LightGCN is defined as:
The symmetric normalization term \(\frac{1}{\sqrt{\left|\mathcal{N}_{u}\right|} \sqrt{\left|\mathcal{N}_{i}\right|}}\) follows the design of standard GCN, which can avoid the scale of embeddings increasing with graph convolution operations.
4.2.2. Layer Combination and Model Prediction¶
The embeddings at the 0-th layer are the only trainable parameters, i.e., \(\mathbf{e}_{u}^{(0)}\) for all users and \(\mathbf{e}_{i}^{(0)}\) for all items. After \(K\) layer, the embeddings are further combined at each layer to arrive at the final representation of a user (an item):
where \(\alpha_{k} \geq 0\) denotes the importance of the \(k\)-th layer embedding in constituting the final embedding. In our experiments, we set \(\alpha_{k}\) uniformly as \(1 / (K+1)\).
The model prediction is defined as the inner product of user and item final representations:
which is used as the ranking score for recommendation generation.
4.2.3. Matrix Form¶
Let the user-item interaction matrix be \(\mathbf{R} \in \mathbb{R}^{M \times N}\) where \(M\) and \(N\) denote the number of users and items, respectively, and each entry \(R_{ui}\) is 1 if \(u\) has interacted with item \(i\) otherwise 0. The adjacency matrix of the user-item graph is
Let the 0-th layer embedding matrix be \(\mathbf{E}^{(0)} \in \mathbb{R}^{(M+N) \times T}\), where \(T\) is the embedding size. Then we can obtain the matrix equivalent form of LGC as:
where \(\mathbf{D}\) is a \((M+N) \times(M+N)\) diagonal matrix, in which each entry \(D_{ii}\) denotes the number of nonzero entries in the \(i\)-th row vector of the adjacency matrix \(\mathbf{A}\) (also named as degree matrix). Lastly, we get the final embedding matrix used for model prediction as:
where \(\tilde{\mathbf{A}}=\mathbf{D}^{-\frac{1}{2}} \mathbf{A} \mathbf{D}^{-\frac{1}{2}}\) is the symmetrically normalized matrix.
4.2.4. Model Training¶
Bayesian Personalized Ranking (BPR) loss is used. BPR is a a pairwise loss that encourages the prediction of an observed entry to be higher than its unobserved counterparts:
Where \(\lambda\) controls the \(L_2\) regularization strength.
4.3. Import required packages¶
import sys
import os
import papermill as pm
import scrapbook as sb
import pandas as pd
import numpy as np
import tensorflow as tf
tf.get_logger().setLevel('ERROR') # only show error messages
from recommenders.utils.timer import Timer
from recommenders.models.deeprec.models.graphrec.lightgcn import LightGCN
from recommenders.models.deeprec.DataModel.ImplicitCF import ImplicitCF
from recommenders.datasets import movielens
from recommenders.datasets.python_splitters import python_stratified_split
from recommenders.evaluation.python_evaluation import map_at_k, ndcg_at_k, precision_at_k, recall_at_k
from recommenders.utils.constants import SEED as DEFAULT_SEED
from recommenders.models.deeprec.deeprec_utils import prepare_hparams
4.4. Read in Data & Set Parameters¶
listens = pd.read_csv('.\\data\\processed\\listens.csv',index_col=0)
artists = pd.read_csv('.\\data\\processed\\artists.csv',index_col=0)
artist_dict = pd.Series(artists.name,index=artists.id).to_dict()
listens.head(3)
| userID | artistID | listenCount | |
|---|---|---|---|
| 0 | 0 | 45 | 3.047442 |
| 1 | 0 | 46 | 3.047442 |
| 2 | 0 | 47 | 3.047442 |
# top k items to recommend
TOP_K = 10
LISTENS_DATA_SIZE = '100k'
# Model parameters
EPOCHS = 50
BATCH_SIZE = 1024
SEED = DEFAULT_SEED # Set None for non-deterministic results
yaml_file = "./lightgcn.yaml"
4.5. LightGCN Implementation¶
4.5.1. Split Data¶
We split the full dataset into a train and test dataset to evaluate performance of the algorithm against a held-out set not seen during training. Because SAR generates recommendations based on user preferences, all users that are in the test set must also exist in the training set. We can use the provided python_stratified_split function which holds out a percentage of items from each user, but ensures all users are in both train and test datasets. We will use a 75/25 train/test split. I considered keeping the split at for consistency with the matrix factorization and softmax models. However,this method relies heavily on users’ historic listening records and is being split in a different manner so I decided against it.
df = listens
df = df.rename(columns={'listenCount': 'rating', 'artistID':'itemID'})
# listens['timestamp'] = np.nan
df.head()
| userID | itemID | rating | |
|---|---|---|---|
| 0 | 0 | 45 | 3.047442 |
| 1 | 0 | 46 | 3.047442 |
| 2 | 0 | 47 | 3.047442 |
| 3 | 0 | 48 | 3.047442 |
| 4 | 0 | 49 | 3.047442 |
train, test = python_stratified_split(df, ratio=0.75)
4.5.2. Process data¶
ImplicitCF is a class that intializes and loads data for the training process. During the initialization of this class, user IDs and item IDs are reindexed, ratings greater than zero are converted into implicit positive interaction, and an adjacency matrix of the user-item graph is created.
data = ImplicitCF(train=train, test=test, seed=SEED)
4.5.3. Prepare hyper-parameters¶
Parameters can be set for ths LightGNC. To save time on tuning parameters we will use the prepared paramemters that can be found in yaml_file. prepare_hparams reads in the yaml file and prepares a full set of parameters for the model.
hparams = prepare_hparams(yaml_file,
n_layers=3,
batch_size=BATCH_SIZE,
epochs=EPOCHS,
learning_rate=0.005,
eval_epoch=5,
top_k=TOP_K,
)
4.5.4. Create and train model¶
With data and parameters prepared, we can create and train the LightGCN model.
model = LightGCN(hparams, data, seed=SEED)
Already create adjacency matrix.
Already normalize adjacency matrix.
Using xavier initialization.
with Timer() as train_time:
model.fit()
print("Took {} seconds for training.".format(train_time.interval))
Epoch 1 (train)9.2s: train loss = 0.42353 = (mf)0.42337 + (embed)0.00016
Epoch 2 (train)9.1s: train loss = 0.19883 = (mf)0.19836 + (embed)0.00047
Epoch 3 (train)8.4s: train loss = 0.15302 = (mf)0.15242 + (embed)0.00059
Epoch 4 (train)9.4s: train loss = 0.12323 = (mf)0.12253 + (embed)0.00070
Epoch 5 (train)8.7s + (eval)1.1s: train loss = 0.10586 = (mf)0.10505 + (embed)0.00080, recall = 0.09133, ndcg = 0.11955, precision = 0.10797, map = 0.04692
Epoch 6 (train)9.0s: train loss = 0.09377 = (mf)0.09288 + (embed)0.00089
Epoch 7 (train)9.1s: train loss = 0.08220 = (mf)0.08122 + (embed)0.00099
Epoch 8 (train)8.5s: train loss = 0.07451 = (mf)0.07344 + (embed)0.00107
Epoch 9 (train)8.6s: train loss = 0.06745 = (mf)0.06629 + (embed)0.00116
Epoch 10 (train)8.7s + (eval)0.9s: train loss = 0.05959 = (mf)0.05835 + (embed)0.00124, recall = 0.11003, ndcg = 0.14639, precision = 0.13027, map = 0.05780
Epoch 11 (train)9.1s: train loss = 0.05491 = (mf)0.05359 + (embed)0.00132
Epoch 12 (train)8.5s: train loss = 0.04991 = (mf)0.04851 + (embed)0.00139
Epoch 13 (train)8.7s: train loss = 0.04857 = (mf)0.04710 + (embed)0.00147
Epoch 14 (train)8.7s: train loss = 0.04412 = (mf)0.04257 + (embed)0.00154
Epoch 15 (train)8.7s + (eval)0.9s: train loss = 0.04175 = (mf)0.04014 + (embed)0.00161, recall = 0.12334, ndcg = 0.16331, precision = 0.14599, map = 0.06553
Epoch 16 (train)8.9s: train loss = 0.03916 = (mf)0.03748 + (embed)0.00168
Epoch 17 (train)9.0s: train loss = 0.03575 = (mf)0.03400 + (embed)0.00175
Epoch 18 (train)9.0s: train loss = 0.03453 = (mf)0.03272 + (embed)0.00182
Epoch 19 (train)8.6s: train loss = 0.03400 = (mf)0.03212 + (embed)0.00188
Epoch 20 (train)9.2s + (eval)1.0s: train loss = 0.03236 = (mf)0.03042 + (embed)0.00194, recall = 0.13128, ndcg = 0.17543, precision = 0.15550, map = 0.07098
Epoch 21 (train)9.1s: train loss = 0.03115 = (mf)0.02916 + (embed)0.00199
Epoch 22 (train)9.4s: train loss = 0.02957 = (mf)0.02751 + (embed)0.00205
Epoch 23 (train)9.0s: train loss = 0.02818 = (mf)0.02608 + (embed)0.00211
Epoch 24 (train)8.7s: train loss = 0.02638 = (mf)0.02422 + (embed)0.00216
Epoch 25 (train)8.9s + (eval)0.9s: train loss = 0.02666 = (mf)0.02445 + (embed)0.00221, recall = 0.13709, ndcg = 0.18341, precision = 0.16240, map = 0.07455
Epoch 26 (train)9.2s: train loss = 0.02509 = (mf)0.02282 + (embed)0.00227
Epoch 27 (train)8.9s: train loss = 0.02325 = (mf)0.02093 + (embed)0.00232
Epoch 28 (train)8.5s: train loss = 0.02202 = (mf)0.01965 + (embed)0.00237
Epoch 29 (train)9.0s: train loss = 0.02138 = (mf)0.01896 + (embed)0.00243
Epoch 30 (train)9.0s + (eval)0.9s: train loss = 0.02225 = (mf)0.01978 + (embed)0.00248, recall = 0.14030, ndcg = 0.18926, precision = 0.16612, map = 0.07737
Epoch 31 (train)8.8s: train loss = 0.02231 = (mf)0.01980 + (embed)0.00251
Epoch 32 (train)8.7s: train loss = 0.02038 = (mf)0.01781 + (embed)0.00256
Epoch 33 (train)8.7s: train loss = 0.02028 = (mf)0.01766 + (embed)0.00261
Epoch 34 (train)8.7s: train loss = 0.01879 = (mf)0.01614 + (embed)0.00266
Epoch 35 (train)8.5s + (eval)0.9s: train loss = 0.01845 = (mf)0.01575 + (embed)0.00271, recall = 0.14525, ndcg = 0.19626, precision = 0.17180, map = 0.08061
Epoch 36 (train)9.0s: train loss = 0.01845 = (mf)0.01569 + (embed)0.00275
Epoch 37 (train)8.8s: train loss = 0.01814 = (mf)0.01535 + (embed)0.00279
Epoch 38 (train)8.7s: train loss = 0.01757 = (mf)0.01474 + (embed)0.00283
Epoch 39 (train)8.8s: train loss = 0.01699 = (mf)0.01412 + (embed)0.00287
Epoch 40 (train)8.4s + (eval)0.9s: train loss = 0.01673 = (mf)0.01382 + (embed)0.00291, recall = 0.14834, ndcg = 0.20013, precision = 0.17515, map = 0.08211
Epoch 41 (train)9.8s: train loss = 0.01574 = (mf)0.01279 + (embed)0.00295
Epoch 42 (train)8.5s: train loss = 0.01637 = (mf)0.01339 + (embed)0.00298
Epoch 43 (train)8.8s: train loss = 0.01714 = (mf)0.01412 + (embed)0.00302
Epoch 44 (train)8.9s: train loss = 0.01568 = (mf)0.01262 + (embed)0.00305
Epoch 45 (train)8.9s + (eval)1.1s: train loss = 0.01618 = (mf)0.01309 + (embed)0.00309, recall = 0.15104, ndcg = 0.20393, precision = 0.17870, map = 0.08413
Epoch 46 (train)9.4s: train loss = 0.01409 = (mf)0.01097 + (embed)0.00313
Epoch 47 (train)8.8s: train loss = 0.01500 = (mf)0.01183 + (embed)0.00316
Epoch 48 (train)9.5s: train loss = 0.01427 = (mf)0.01107 + (embed)0.00319
Epoch 49 (train)8.3s: train loss = 0.01427 = (mf)0.01104 + (embed)0.00323
Epoch 50 (train)10.5s + (eval)1.2s: train loss = 0.01370 = (mf)0.01044 + (embed)0.00326, recall = 0.15244, ndcg = 0.20657, precision = 0.18019, map = 0.08504
Took 455.2403181999998 seconds for training.
4.5.5. Recommendations¶
recommend_k_items produces k artist recommendations for each user passed to the function. remove_seen=True removes the artists already listened to by the user. We will produce recommendations using the trained model on instances from the test set as input.
topk_scores = model.recommend_k_items(test, top_k=TOP_K, remove_seen=True)
top_scores = topk_scores
top_scores['name'] = topk_scores.itemID.map(artist_dict)
top_scores.head()
| userID | itemID | prediction | name | |
|---|---|---|---|---|
| 0 | 0 | 992 | 12.571548 | Pet Shop Boys |
| 1 | 0 | 151 | 11.216814 | Michael Jackson |
| 2 | 0 | 181 | 11.112077 | a-ha |
| 3 | 0 | 593 | 10.635868 | David Bowie |
| 4 | 0 | 1005 | 10.495359 | Erasure |
def user_recommendations(user):
listened_to = train[train.userID == user].sort_values('rating',ascending=False)
listened_to['name'] = listened_to.itemID.map(artist_dict)
listened_to = listened_to.head(10).name
print('User ' + str(user) + ' most listened to artists...')
print('\n'.join(listened_to) + '\n')
topk_scores_recs = topk_scores[topk_scores.userID == user].sort_values('prediction',ascending=False).name
print('User ' + str(user) + ' recommendations...')
print('\n'.join(topk_scores_recs.tolist()))
return
user_recommendations(user=500)
User 500 most listened to artists...
Christina Aguilera
John Mayer
Chico Buarque
Sarah Brightman
Oasis
The Beatles
Lady Gaga
Adele
Justin Timberlake
Paul McCartney
User 500 recommendations...
Britney Spears
Beyoncé
Kylie Minogue
P!nk
Coldplay
Amy Winehouse
Black Eyed Peas
Mariah Carey
Ke$ha
Kelly Clarkson
user_recommendations(user=300)
User 300 most listened to artists...
Van Halen
KISS
Iron Maiden
Black Sabbath
Leaves' Eyes
Epica
The Agonist
Five Finger Death Punch
AC/DC
Deadstar Assembly
User 300 recommendations...
System of a Down
Metallica
Korn
In Flames
Megadeth
Rammstein
Bullet for My Valentine
HIM
Judas Priest
Pantera
At a glance, the recommendation system appears to work extremely well. User 500 has pretty broad and genric music tastes, yet each recommended artist makes sense. User 300 appears to have more specified music interests. Most of user 300’s top listened to artists are rock/heavy metal bands from the 70s/80s. The recommendations are also mainly rock/heavy metal bands from the same time period. Across both users, all recommendations appear relevant and potentially useful.
4.5.6. Evaluation¶
With topk_scores (k=10) predicted by the model, we can evaluate how LightGCN performs on the test set. We will use four evaluation metrics:
Mean Average Precision (MAP)
Normalized Discounted Cumulative Gain (NDCGG)
Precision at 10
Recall at 10
eval_map = map_at_k(test, topk_scores, k=TOP_K)
eval_ndcg = ndcg_at_k(test, topk_scores, k=TOP_K)
eval_precision = precision_at_k(test, topk_scores, k=TOP_K)
eval_recall = recall_at_k(test, topk_scores, k=TOP_K)
print("MAP:\t%f" % eval_map,
"NDCG:\t%f" % eval_ndcg,
"Precision@K:\t%f" % eval_precision,
"Recall@K:\t%f" % eval_recall, sep='\n')
MAP: 0.023575
NDCG: 0.082716
Precision@K: 0.087785
Recall@K: 0.074625
These results are promising and they back up the assumption made from looking at two users’ recommendations that the model works. Although, the test split was different than the test splits used to evaluate matrix factorization and softmax, this model’s precision is still almost 10 times higher. It appears that this is the superior recommendation system and that we have managed to beat the standard of the initial matrix factorization model.
4.6. Conclusion¶
LightGCN is a light weight and efficient form of a GCN that can be quickly built, trained, and evaluated on this dataset without the need for a GPU. Even without tuning the hyperparameters, the results and recommendations produced by this model are impressive. Here, we have produced a relevant and potentially useful artist recommendation system. The recommender library was also extremely useful and appropiate for our objective of building an artist recommender system using our Last.fm dataset.
5. Content-Based Recommender¶
This section involves building a simple content-based recommender system for artists. Content-based filtering involves making reecommendations based on item features. For this dataset, we have limited item features, only taggings made by the users. Last.fm allows users to tag artists with keywords and our dataset contains this user-artist tagging information. These keywords can be anything, but are usually relective of genre or sentiment.
The recommender built below takes an artist as input and outputs a list of similar artists based only on the tags received by each artists. There are too many tags to one-hot-encode all of them. Instead I decided on a TF-IDF approach to convert the set of tags for each artists into numerical values. After performing TF-IDF vectorization I was able to compute a cosine similarity matrix. The recommender system works by simply recommending the artists with the highest similarity scores to the inputted artist.
5.1. Outline¶
TF-IDF of Tag List
Calculate Similarity Matrix
Recommender
Sanity Check
5.2. Import required packages¶
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from ast import literal_eval
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
import warnings
warnings.filterwarnings('ignore')
5.3. Read in data¶
artist_features = pd.read_csv('..\\data\\processed\\artist_features.csv',index_col=0)
Filter out artists with no tags
This method finds similarity based on tags. Therefore, artists that have not been tagged at all can not compared using tags. We will fiter out all artists that have no tags.
artist_features = artist_features[artist_features['peak_year'].notna()]
Convert all_tags feature to string
all_tags is currently stored as a list of strings. Let’s convert this into a space seperated string.
artist_features.all_tags = artist_features.all_tags.apply(lambda x: ' '.join(literal_eval(x)))
artist_features.all_tags.head(3)
0 weeabo jrock visualkei betterthanladygaga goth...
1 german seenlive darkwave industrial gothic amb...
2 blackmetal norwegianblackmetal truenorwegianbl...
Name: all_tags, dtype: object
5.4. TF-IDF for tags¶
Term Frequency-Inverse Document Frequency (TF-IDF) is a technique that can be used to quantify textual data. Term frequency measures the frequency of a word in a document. Inverse document frequency is a measure of how common or rare a word is across the entire dataset. When combined TF-IDF increases proportionally to the occurances of a word in a document, but is offset by the number of documnets that contain that word. TF-IDF can be used to compute a score that is indicative of the importance of each word within the document and the corpus.
We will apply TF-IDF vectorization to all_tags. This will transform the all_tags column to numerical data. Then we will be able to compare the values for different artists and calculate their similarity based on some form of a similarity score.
tfidf = TfidfVectorizer()
#Construct the required TF-IDF matrix by applying the fit_transform method on the all_tags feature
all_tags_matrix = tfidf.fit_transform(artist_features['all_tags'])
#Output the shape of tfidf_matrix
all_tags_matrix.shape
# Map Matrix Index to Artist Name
mapping = pd.Series(artist_features.index,index = artist_features['name'])
all_tags_matrix
<12133x9396 sparse matrix of type '<class 'numpy.float64'>'
with 106845 stored elements in Compressed Sparse Row format>
5.5. Similarity Matrix¶
We now have a TF-IDF feature matrix for all of the artists. Every artist has 9,396 featuress (tag words). In order to find the similarity between the artists, we will use cosine_similarity. The linear_kernel function will compute the cosine_similarity for us.
similarity_matrix = linear_kernel(all_tags_matrix,all_tags_matrix)
similarity_df = pd.DataFrame(similarity_matrix, columns = artist_features['name'], index = artist_features['name'])
similarity_df.head(3)
| name | MALICE MIZER | Diary of Dreams | Carpathian Forest | Moi dix Mois | Bella Morte | Moonspell | Marilyn Manson | DIR EN GREY | Combichrist | Grendel | ... | Electrosoul System | Nostalgia 77 | The Young Gods | Wiseblood | LOSTFREEQ | Ciccone Youth | Apollo 440 | Die Krupps | Diamanda Galás | Oz Alchemist |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| name | |||||||||||||||||||||
| MALICE MIZER | 1.00000 | 0.09213 | 0.0 | 0.436071 | 0.125768 | 0.068878 | 0.029499 | 0.555061 | 0.000000 | 0.000000 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.000000 |
| Diary of Dreams | 0.09213 | 1.00000 | 0.0 | 0.081005 | 0.395459 | 0.250002 | 0.127415 | 0.095416 | 0.068551 | 0.096065 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.050611 | 0.0 | 0.044474 | 0.166139 | 0.0 | 0.034243 |
| Carpathian Forest | 0.00000 | 0.00000 | 1.0 | 0.000000 | 0.000000 | 0.067779 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.000000 |
3 rows × 12133 columns
plt.figure(figsize=(12,12))
sns.heatmap(similarity_df)
plt.show()
The heatmap is mostly dark meaning that most artist pairs have a low similarity between the TF-IDF scores for their tags. However, there are some bright pixels which illustrates that some artists do have very similar tags. The line of bright pixels going from the top left corner to the bottom right corner captures the similarity between artists and themselves, which is always 1. As this is a large matrix (9396 x 9396), it is not possible to visualise the matrix in a manner where can clearly identify which artists have similar tags.
5.6. Recommender Function¶
Now, we will make a recommender function that recommends artists using cosine_similarity. The function will take an aritst name as input and identify the top 10 similar artists using the cosine similarity matrix.
def recommend_artists_based_on_tags(artist_input, k):
artist_index = mapping[artist_input]
#get similarity values with other artists
#similarity_score is the list of index and similarity matrix
similarity_score = list(enumerate(similarity_matrix[artist_index]))
#sort in descending order the similarity score of artist inputted with all the other artists
similarity_score = sorted(similarity_score, key=lambda x: x[1], reverse=True)
# Get the scores of the k most similar movies. Ignore the first movie.
similarity_score = similarity_score[1:k]
#return movie names using the mapping series
artist_indices = [i[0] for i in similarity_score]
return (artist_features['name'].iloc[artist_indices])
recommend_artists_based_on_tags('The Beatles', 10)
1762 Rita Lee
9881 Karyn White
9668 Freddie Jackson
10727 Keith Sweat
2526 Vanessa Carlton
4871 Glenn Medeiros
334 Cher
9885 Patti LaBelle
5619 Devdas
Name: name, dtype: object
The above 10 recommendations for users who like The Beatles do not appear to be particularly relevant. The recommendations are based on artists who have received similar taggings to The Beatles. This approach purely uses tag features about the artists. It does not take into account how popular or unpopular the artists are. For this reason, the recommendations are poor and would likely be of little interest to users. Although this method alone would not make a good recommender system, it could still prove useful if combined with a collaborative filtering approach.
5.7. Sanity check similarity scores¶
It is hard to quantify the potential usefulness of this method for a hybrid approach. To determine if this approach could add value, we will look at the similariy scores for well known artist pairs. Before we do that though, we will calculate the mean similarity score of the entire similarity matrix. It is important to know this so we can judge whether a particular pair has a relatively high or low similarity score.
sim_mean = similarity_df.to_numpy().mean()
print('Mean similarity score: ', sim_mean)
Mean similarity score: 0.01383527847128692
U2 & Coldplay
similarity_df[similarity_df.index == 'U2']['Coldplay']['U2']
0.22446369776706124
U2 & Lady Gaga
similarity_df[similarity_df.index == 'U2']['Lady Gaga']['U2']
0.08808980506392858
Lady Gaga & Katy Perry
similarity_df[similarity_df.index == 'Katy Perry']['Lady Gaga']['Katy Perry']
0.24031142443152187
Katy Perry & Luciano Pavarotti
similarity_df[similarity_df.index == 'Luciano Pavarotti']['Katy Perry']['Luciano Pavarotti']
0.0
Luciano Pavarotti & Andrea Bocelli
similarity_df[similarity_df.index == 'Andrea Bocelli']['Luciano Pavarotti']['Andrea Bocelli']
0.22851920014637395
Above are the similarity scores for pairs of artists that I would either associate as similar or dissimilar artists. The results are promising, each artist pair that I would identify as similar have a cosine similarity score of above 0.22. Meanwhile, the artist pairs I would label as disimilar have scores below 0.89. U2 and Coldplay are both European rock bands from the same era, it is reasonable to think that they received similar taggings. U2 and Lady Gaga are less similar, however there were both popular around the same time. Lady Gaga and Katy Perry have the highest similarity score of the above examples, which is unsurprising as they are both popular female pop artists from the same time period. Katy Perry and Luciano Pavarotti, the most extreme artist pair I could think of, have a similarty score of 0. Meanwhile, Luciano Pavarotti and Andrea Bocelli,two Italian opera tenors, had a relatively high similarity score of 0.23. While the above examples are not representative of the entire dataset they are indicative that this content-based recommender can identify similar artists based solely on tags.
5.8. Conclusion¶
It is clear that this system on its own would make a poor recommender system. However, the tags received by an artist are informative and can be used to identify similar artists. Through TF-IDF we captured information from each unique tag received by the artist. By keeping all tags and representing them in a vector space we were able assess similarity of artists. A content-based similarity matrix such as this one could be used to improve the results of a collaborative filtering recommender system. Uunfortunately, due to time constraits, I am unable to integrate this method with the previous collaborative approach to form a hybrid.
6. Conclusion¶
Recommender systems are extremely prevelant and can vary in levels of complexity. Within the project alone, I created four different models, some with different variations. I initally followed a Google Colab tutorial to make a matrix factorization recommender system. Within that, I experimented with different methods including regularlization and similarity measures. I then used an open source library to assist in making a Light Graph Convolution Network recommender, which resulted in highly relevant recommendations, even without parameter tuning. Finally, I made a simple content-based recommender system, which can identify two Italian opera tenor as similar based soley on the tags that they both received. While the content-based model was shown not to be suitable on its own, it could be combined with another system to improve performance.
One of my key takeaways from this project is the importance of data pre-processing. I initally attempted to create the matrix factorization model without reseting the indices and making them consecutive. This was problamatic to tensorflow and due to the size of the model, debugging took quite a bit of time. Additionally, the decision on how to deal with the weight (listen counts) feature was higly impactful for the rest of the work. Due to the computational complexity of most machine learning models if pre-processing is not done correctly a lot of time can be wasted going back and processing the data properly before retraining the model. This project has highlighted importance good pre-processing practices.
A proposal for future work is to make a hybrid collaborative filtering/ content-based model. This would perform collaborative filtering to produce a set of potential recommendations. The content-based model would then rank the potential recommendations to arrive at the optimal recommendations. Another area for further work is focusing on the relationship data betweeen friends. Due to time constraints, I neglected to use this data. I would like to further examine the similarity between friends, particularly whether they share an interest in certain, potentially niche, genres and whether harvesting these relationships can improve recommendations.
7. Bibliography¶
- 1
F.O. Isinkaye, Y.O. Folajimi, and B.A. Ojokoh. Recommendation systems: principles, methods and evaluation. Egyptian Informatics Journal, 16(3):261–273, 2015. URL: https://www.sciencedirect.com/science/article/pii/S1110866515000341, doi:https://doi.org/10.1016/j.eij.2015.06.005.
- 2
Baptiste Rocca. Introduction to recommender systems. Jun 2019. URL: https://towardsdatascience.com/introduction-to-recommender-systems-6c66cf15ada.